memory bandwidth
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Education (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Artificial Intelligence > Vision (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Nrusimha, Aniruddha, Brandon, William, Mishra, Mayank, Shen, Yikang, Panda, Rameswar, Ragan-Kelley, Jonathan, Kim, Yoon
The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration
Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.
The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis
Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference latency, which in turn demands powerful hardware. High-performance graphical processing units (GPUs) provide the necessary compute and memory throughput to run large LLMs on imaging data. We review modern GPU architectures (e.g. NVIDIA A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of floating-point throughput, memory bandwidth, VRAM capacity. We show how these hardware capabilities affect radiology tasks: for example, generating reports or detecting findings on CheXpert and MIMIC-CXR images is computationally intensive and benefits from GPU parallelism and tensor-core acceleration. Empirical studies indicate that using appropriate GPU resources can reduce inference time and improve throughput. We discuss practical challenges including privacy, deployment, cost, power and optimization strategies: mixed-precision, quantization, compression, and multi-GPU scaling. Finally, we anticipate that next-generation features (8-bit tensor cores, enhanced interconnect) will further enable on-premise and federated radiology AI. Advancing GPU infrastructure is essential for safe, efficient LLM-based radiology diagnostics.
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Apple Just Upgraded the iPad Pro, MacBook Pro, and Vision Pro with Its New M5 Chip
The hardware largely remains the same, but performance gets a boost. All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. Without much fanfare, Apple has unveiled three new flagship products today via a press release--no special event, no pre-recorded show. That might be because the new iPad Pro, MacBook Pro, and Vision Pro don't change the mold--they're identical to their predecessors--but internally, they're debuting Apple's highly anticipated M5 chip.
- Asia > Middle East > Palestine > Gaza Strip > Gaza Governorate > Gaza (0.05)
- North America > United States > California (0.05)
- Europe > Slovakia (0.05)
- Europe > Czechia (0.05)
- Information Technology (0.70)
- Transportation > Ground > Road (0.70)
- Information Technology > Communications > Mobile (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.48)
Faster Neighborhood Attention: Reducing the O (n 2) Cost of Self Attention at the Threadblock Level
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Education (0.67)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
- Information Technology > Artificial Intelligence > Vision (0.68)
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
Zhang, Hengrui, Patel, Pratyush, Ning, August, Wentzlaff, David
Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > Florida > Orange County > Orlando (0.04)
- (14 more...)
RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models
Zong, Hua, Zeng, Qingtao, Zhou, Zhengxiong, Han, Zhihua, Yan, Zhensong, Liu, Mingjie, Sun, Hechen, Liu, Jiawei, Hu, Yiwen, Wang, Qi, Xian, YiHan, Guo, Wenjie, Xiang, Houyuan, Zeng, Zhiyuan, Sheng, Xiangrong, Yan, Bencheng, Hu, Nan, Huang, Yuheng, Lian, Jinqing, Xu, Ziru, Zhang, Yan, Huang, Ju, Yang, Siran, Yi, Huimin, Wang, Jiamang, Wang, Pengjie, Zhu, Han, Wu, Jian, Ou, Dan, Xu, Jian, Tang, Haihong, Jiang, Yuning, Zheng, Bo, Qu, Lin
In this paper, we propose RecIS, a unified Sparse-Dense training framework designed to achieve two primary goals: 1. Unified Framework To create a Unified sparse-dense training framework based on the PyTorch ecosystem that meets the training needs of industrial-grade recommendation models that integrated with large models. 2.System Optimization To optimize the sparse component, offering superior efficiency over the TensorFlow-based recommendation models. The dense component, meanwhile, leverages existing optimization technologies within the PyTorch ecosystem. Currently, RecIS is being used in Alibaba for numerous large-model enhanced recommendation training tasks, and some traditional sparse models have also begun training in it.
- Information Technology (0.69)
- Education (0.48)
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving
Wu, Bingyang, Zhang, Zili, Zhong, Yinmin, Huang, Guanzhe, Zhu, Yibo, Liu, Xuanzhe, Jin, Xin
Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and computation performance together, leading to load imbalance, data redundancy, and memory fragmentation of caching systems across instances. To address these issues, memory pooling is promising to shield the scheduler from the underlying cache management so that it can focus on the computation optimization. However, because existing prefix caching systems only transfer increasingly longer prefix caches between instances, they cannot achieve low-latency memory pooling. To address these problems, we propose a unified segment-level prefix cache pool, TokenLake. It uses a declarative cache interface to expose requests' query tensors, prefix caches, and cache-aware operations to TokenLake for efficient pooling. Powered by this abstraction, TokenLake can manage prefix cache at the segment level with a heavy-hitter-aware load balancing algorithm to achieve better cache load balance, deduplication, and defragmentation. TokenLake also transparently minimizes the communication volume of query tensors and new caches. Based on TokenLake, the scheduler can schedule requests elastically by using existing techniques without considering prefix cache management. Evaluations on real-world workloads show that TokenLake can improve throughput by up to 2.6$\times$ and 2.0$\times$ and boost hit rate by 2.0$\times$ and 2.1$\times$, compared to state-of-the-art cache-aware routing and cache-centric PD-disaggregation solutions, respectively.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)